Visual Detective Assignment

Visual Detective R Assignment

This assignment attempts to solve the 2021 IEEE Visual Analytics Science and Technology (VAST) Challenge: Mini-Challenge 2 by applying different visual analytics concepts, methods, and techniques with relevant R data visualisation and data analysis packages.

Archie Dolit https://www.linkedin.com/in/adolit/ (School of Computing and Information Systems, Singapore Management University)
07-25-2021

1. Overview

The VAST Challenge 2021 is a rerun of the VAST Challenge 2014 with the same story line about the fictitious island country of Kronos, the company GASTech, and the incidents involving missing GAStech employees. However, the data for the VAST Challenge 2021 were modified and new questions were introduced.

Using the two weeks worth of data leading to the disappearance of the GAStech employees, the goal of Mini-Challenge 2 is to analyze the movement and tracking of company vehicles GPS data. Together with GAStech employee’s credit card transactions and Kronos Kares loyalty card data, the main objectives are:

2. Literature Review

2.1 Data Understanding

The 2021 data was compared against 2014 data using the diffr package to better understand the variations of VAST Challenge 2021 from the previous year’s challenge.

library(diffr)
diffr("data/aspatial/car-assignments.csv", "data_2014/aspatial/car-assignments.csv")

diffr("data/aspatial/loyalty_data.csv", "data_2014/aspatial/loyalty_data.csv")

diffr("data/aspatial/cc_data.csv", "data_2014/aspatial/cc_data.csv")

library(tools)
md5sum("data/aspatial/gps.csv") == md5sum("data_2014/aspatial/gps.csv")
data/aspatial/gps.csv 
                 TRUE 
2021 VAST Challenge Map 2014 VAST Challenge Map
md5sum("data/aspatial/MC2-tourist.jpg") == md5sum("data_2014/aspatial/MC2-tourist.jpg")
data/aspatial/MC2-tourist.jpg 
                         TRUE 
md5sum("data/Geospatial/Abila.dbf") == md5sum("data_2014/Geospatial/Abila.dbf")
md5sum("data/Geospatial/Abila.kml") == md5sum("data_2014/Geospatial/Abila.kml")
md5sum("data/Geospatial/Abila.prj") == md5sum("data_2014/Geospatial/Abila.prj")
md5sum("data/Geospatial/Abila.sbn") == md5sum("data_2014/Geospatial/Abila.sbn")
md5sum("data/Geospatial/Abila.sbx") == md5sum("data_2014/Geospatial/Abila.sbx")
md5sum("data/Geospatial/Abila.shp") == md5sum("data_2014/Geospatial/Abila.shp")
md5sum("data/Geospatial/Abila.shx") == md5sum("data_2014/Geospatial/Abila.shx")

md5sum("data/Geospatial/Kronos Island.kmz") == md5sum("data_2014/Geospatial/Kronos Island.kmz")
md5sum("data/Geospatial/Kronos_Island.dbf") == md5sum("data_2014/Geospatial/Kronos_Island.dbf")
md5sum("data/Geospatial/Kronos_Island.prj") == md5sum("data_2014/Geospatial/Kronos_Island.prj")
md5sum("data/Geospatial/Kronos_Island.sbn") == md5sum("data_2014/Geospatial/Kronos_Island.sbn")
md5sum("data/Geospatial/Kronos_Island.sbx") == md5sum("data_2014/Geospatial/Kronos_Island.sbx")
md5sum("data/Geospatial/Kronos_Island.shp") == md5sum("data_2014/Geospatial/Kronos_Island.shp")
md5sum("data/Geospatial/Kronos_Island.shx") == md5sum("data_2014/Geospatial/Kronos_Island.shx")

2.2 Guide Questions

VAST Challenge 2014 focuses about ‘Patterns of Life’ analysis. It asked about the common daily routines of GAStech employees and what does a day in the life of typical GAStech employee look like.

In contrast, VAST Challenge 2021 asks to infer the owners of each credit card and loyalty card since the employee names were replaced by last 4 digits of the credit or debit card number and unique 5-character code loyalty number.

Nevertheless, both challenges want to know about unusual events, anomalies, and evidences of suspicious activities.

2.3 Visualisation Approaches

By reviewing the submissions for VAST Challenge 2014, several approaches were identified to be relevant to the current VAST challenge and reproducible using R data visualisation and data analysis packages.

Mini-Challenge 2 emphasizes the geospatial-temporal data analysis with the financial data from the credit card and loyalty transactions. The common approach from several submissions was to highlight roadway paths of the car and indicate the position and time relationship. The figure below from the Peking University, recipient of Excellent Comprehensive Visual Analysis System Award, shows an example geospatial-temporal visualisation.

This example of movement data visualisation can be achieved using sf, raster, readr, clock and tmap packages. It can also be improved by having an interactive map and tooltip information.

The heatmap visualisation below from Central South University, recipient of Outstanding Visualization and Analysis Award, shows the credit card transactions of general staff which can also be used to identify the most popular spots and when they are popular.

This example of heatmap visualisation can be achieved using gglot2 and plotly packages. It can also be improved by having an interactive tooltip information.

The concept of ‘Point of Interest’ (POI) from Virginia Tech, recipient of Effective Presentation Honorable Mention, shows the location with a diameter of 50 meters where an employee stops for more than 5 minutes. The POI concept can be utilized to correlate the gps tracking data, credit and debit card transactions, and loyalty card data to help in identifying the owners of the credit card and loyalty cards.

This example of POI data visualisation can be achieved by reusing the geospatial-temporal packages.

3. Methodology

3.1 Install and Lauch R Packages

The code chunk below is used to install and load the packages.

packages = c('ggiraph', 'plotly','lobstr',
             'raster','sf', 'tmap', 
             'igraph', 'tidygraph', 
             'ggraph', 'visNetwork', 
             'lubridate', 'clock',
             'widyr', 'wordcloud',
             'ggwordcloud', 'DT',
             'textplot', 'hms',
             'timetk','tidyverse')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3.2 Import Data

Import the csv files into R using read_csv() of readr package.

car_data <- read_csv("data/aspatial/car-assignments.csv")
cc_data <- read_csv("data/aspatial/cc_data.csv")
loyalty_data <- read_csv("data/aspatial/loyalty_data.csv")
gps_data <- read_csv("data/aspatial/gps.csv")

glimpse(car_data)
Rows: 44
Columns: 5
$ LastName               <chr> "Calixto", "Azada", "Balas", "Barranc~
$ FirstName              <chr> "Nils", "Lars", "Felix", "Ingrid", "I~
$ CarID                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType  <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~
glimpse(cc_data)
Rows: 1,490
Columns: 4
$ timestamp  <chr> "01/06/2014 07:28", "01/06/2014 07:34", "01/06/20~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
glimpse(loyalty_data)
Rows: 1,392
Columns: 4
$ timestamp  <chr> "01/06/2014", "01/06/2014", "01/06/2014", "01/06/~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~
glimpse(gps_data)
Rows: 685,169
Columns: 4
$ Timestamp <chr> "01/06/2014 06:28:01", "01/06/2014 06:28:01", "01/~
$ id        <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~

Produce a georeference tif file called abila_map.tif from the tourist map MC2-Tourist.jpg and Abila shapefiles using an external open-source geographic information system (GIS) software QGIS.

Import abila_map.tif into R using raster() of Raster package,

bgmap <- raster("data/Geospatial/abila_map.tif")

tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255)

3.3 Prepare Data

There are several employees with the same last name and same first name. Create a new column FullName and combine FirstName and LastName to have unique employee name identifier using mutate() of dplyr package.

Additionally, rename columns CurrentEmploymentType to Department and CurrentEmploymentTitle to Title using rename(). Finally, convert carID field from numerical to factor data type.

car_data <- car_data %>%
  #concatenate first and last name
  mutate(FullName = paste(FirstName, LastName, sep = " ")) %>%
  rename(Deparment = CurrentEmploymentType) %>%
  rename(Title = CurrentEmploymentTitle)

car_data$CarID <- as_factor(car_data$CarID)

glimpse(car_data)
Rows: 44
Columns: 6
$ LastName  <chr> "Calixto", "Azada", "Balas", "Barranco", "Baza", "~
$ FirstName <chr> "Nils", "Lars", "Felix", "Ingrid", "Isak", "Linnea~
$ CarID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,~
$ Deparment <chr> "Information Technology", "Engineering", "Engineer~
$ Title     <chr> "IT Helpdesk", "Engineer", "Engineer", "SVP/CFO", ~
$ FullName  <chr> "Nils Calixto", "Lars Azada", "Felix Balas", "Ingr~

Katerina’s Cafe’ causes error when plotting a graph because of special characters. Convert the special characters into string format using mutate() and str_detec() functions.

Additionally, convert the timestamp() from character datatype to date-time format using data-time_parse() of clock package, then get the date, day of the week, and hour of transaction.

#detect and replace Katerina to Katerina's Cafe
cc_data <- cc_data %>%
    mutate(location = ifelse(str_detect(location, "Katerina"), "Katerina's Cafe", location))

#convert to date-time format
cc_data$date <- date_time_parse(cc_data$timestamp,
                zone = "",
                format = "%m/%d/%Y")
cc_data$day <- wday(cc_data$date,
                          label = TRUE,
                          abbr = TRUE)

cc_data$timestamp <- date_time_parse(cc_data$timestamp,
                zone = "",
                format = "%m/%d/%Y %H:%M")

cc_data$hour <- get_hour(cc_data$timestamp)

glimpse(cc_data)
Rows: 1,490
Columns: 7
$ timestamp  <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ date       <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ day        <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon,~
$ hour       <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7~

Similar to cc_data, convert the special characters of Katerina’s Cafe’ into string format, convert the timestamp from character datatype to date-time format, then get the date and day of the week of transaction. Note that loyalty_data does not include the hour and minutes of the transaction.

#detect and replace Katerina to Katerina's Cafe
loyalty_data <- loyalty_data %>%
    mutate(location = ifelse(str_detect(location, "Katerina"), "Katerina's Cafe", location))

#convert to date-time format
loyalty_data$date <- date_time_parse(loyalty_data$timestamp,
                zone = "",
                format = "%m/%d/%Y")

loyalty_data$timestamp <- date_time_parse(loyalty_data$timestamp,
                zone = "",
                format = "%m/%d/%Y")

loyalty_data$day <- wday(loyalty_data$timestamp,
                          label = TRUE,
                          abbr = TRUE)

glimpse(loyalty_data)
Rows: 1,392
Columns: 6
$ timestamp  <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~
$ date       <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ day        <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon,~

Rename Timestamp to timestamp and id to CarID so it will be consistent with other data frame. Similar to cc_data and loyalty_data, convert the timestamp from character datatype to date-time format using data-time_parse(), then get the date and day of the week.

Convert CarID field from numerical to factor data type. Lastly, convert the gps data frame into a simple feature data frame using st_as_sf() of sf package.

#rename columns for consistency
gps_data <- gps_data %>%
  rename(timestamp = Timestamp) %>%
  rename(CarID = id)

#convert to date-time format
gps_data$date <- date_time_parse(gps_data$timestamp,
                zone = "",
                format = "%m/%d/%Y")

gps_data$day <- as.factor(wday(gps_data$date,
                          label = TRUE,
                          abbr = TRUE))

gps_data$timestamp <- date_time_parse(gps_data$timestamp,
                zone = "",
                format = "%m/%d/%Y %H:%M:%S")

gps_data$hour <- get_hour(gps_data$timestamp)

#convert to factor data type
gps_data$CarID <- as_factor(gps_data$CarID)


glimpse(gps_data)
Rows: 685,169
Columns: 7
$ timestamp <dttm> 2014-01-06 06:28:01, 2014-01-06 06:28:01, 2014-01~
$ CarID     <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
$ date      <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, 2~
$ day       <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, ~
$ hour      <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,~
#convert to simple feature 
gps_sf <- st_as_sf(gps_data, 
                   coords = c("long", "lat"),
                       crs= 4326)
gps_sf
Simple feature collection with 685169 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS:  WGS 84
# A tibble: 685,169 x 6
   timestamp           CarID date                day    hour
 * <dttm>              <fct> <dttm>              <ord> <int>
 1 2014-01-06 06:28:01 35    2014-01-06 00:00:00 Mon       6
 2 2014-01-06 06:28:01 35    2014-01-06 00:00:00 Mon       6
 3 2014-01-06 06:28:03 35    2014-01-06 00:00:00 Mon       6
 4 2014-01-06 06:28:05 35    2014-01-06 00:00:00 Mon       6
 5 2014-01-06 06:28:06 35    2014-01-06 00:00:00 Mon       6
 6 2014-01-06 06:28:07 35    2014-01-06 00:00:00 Mon       6
 7 2014-01-06 06:28:09 35    2014-01-06 00:00:00 Mon       6
 8 2014-01-06 06:28:10 35    2014-01-06 00:00:00 Mon       6
 9 2014-01-06 06:28:11 35    2014-01-06 00:00:00 Mon       6
10 2014-01-06 06:28:12 35    2014-01-06 00:00:00 Mon       6
# ... with 685,159 more rows, and 1 more variable:
#   geometry <POINT [°]>

3.4 Join Data

Combine the cc_data and loyalty_data based on the purchase information like the location, date and price of transaction using full_join() of dplyr package. Exclude day and timestamp from loyalty_data since these fields are redundant with cc_data. Rearrange the columns into timestamp, date, day, hour, location, price, last4ccnum, loyaltynum.

#combine based on date, location, price, exclude day and timestamp
cc_loyalty_data <- full_join(cc_data %>% select(-c("day")),
                             loyalty_data %>% select(-c("day","timestamp")), 
                             by = c("date" = "date", 
                                    "location" = "location", 
                                    "price" = "price"))

#get day of the joint data
cc_loyalty_data$day <- wday(cc_loyalty_data$date,
                          label = TRUE,
                          abbr = TRUE)

#rearrange columns
cc_loyalty_data <- cc_loyalty_data %>%
  select("timestamp", "date", "day", "hour", "location", "price", "last4ccnum", "loyaltynum")

glimpse(cc_loyalty_data)
Rows: 1,807
Columns: 8
$ timestamp  <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ date       <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ day        <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon,~
$ hour       <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ loyaltynum <chr> "L8566", NA, "L8148", "L5553", "L3800", "L2247", ~

The joint financial data reveals 1,807 entries. Some entries have last4ccnum but without loyaltynum, have loyaltynum but without last4ccnum. Additionally, last4ccnum does not necessarily correpond to only 1 loyaltynum which means the owner may use multiple credit or debit cards for their loyalty card or vice versa.

Combine the car_data and gps_data based CarID using left_join() of dplyr package. Exclude FirstName and LastName from car_data since these fields are redundant with FullName.

#combine based on CarID
car_gps_data <- left_join(gps_data, 
                          car_data %>% select(-c("FirstName", "LastName")),
                          by = "CarID")

glimpse(car_gps_data)
Rows: 685,169
Columns: 10
$ timestamp <dttm> 2014-01-06 06:28:01, 2014-01-06 06:28:01, 2014-01~
$ CarID     <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
$ date      <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, 2~
$ day       <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, ~
$ hour      <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,~
$ Deparment <chr> "Executive", "Executive", "Executive", "Executive"~
$ Title     <chr> "Environmental Safety Advisor", "Environmental Saf~
$ FullName  <chr> "Willem Vasco-Pais", "Willem Vasco-Pais", "Willem ~
car_gps_sf <- left_join(gps_sf,
                        car_data %>% select(-c("FirstName", "LastName")),
                        by = "CarID")

car_gps_sf
Simple feature collection with 685169 features and 8 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS:  WGS 84
# A tibble: 685,169 x 9
   timestamp           CarID date                day    hour
   <dttm>              <fct> <dttm>              <ord> <int>
 1 2014-01-06 06:28:01 35    2014-01-06 00:00:00 Mon       6
 2 2014-01-06 06:28:01 35    2014-01-06 00:00:00 Mon       6
 3 2014-01-06 06:28:03 35    2014-01-06 00:00:00 Mon       6
 4 2014-01-06 06:28:05 35    2014-01-06 00:00:00 Mon       6
 5 2014-01-06 06:28:06 35    2014-01-06 00:00:00 Mon       6
 6 2014-01-06 06:28:07 35    2014-01-06 00:00:00 Mon       6
 7 2014-01-06 06:28:09 35    2014-01-06 00:00:00 Mon       6
 8 2014-01-06 06:28:10 35    2014-01-06 00:00:00 Mon       6
 9 2014-01-06 06:28:11 35    2014-01-06 00:00:00 Mon       6
10 2014-01-06 06:28:12 35    2014-01-06 00:00:00 Mon       6
# ... with 685,159 more rows, and 4 more variables:
#   geometry <POINT [°]>, Deparment <chr>, Title <chr>,
#   FullName <chr>

The joint geospatial data reveals that some CarID cannot be mapped to specific employees. Most probably these are the truck drivers who have no specific car assignment.

3.5 Proposed Solutions

Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?

Generate an interactive bar graph in descending order using ggplot and plotly to determine the most popular locations.

popular_combine <- cc_loyalty_data %>%
  group_by(location) %>%
  summarize(total_count=n()) %>%
  ggplot(aes(x=reorder(location, total_count),
             y=total_count,
             text = paste("Location :", location,"\n",
                          "Number of transactions:", total_count))) +
  geom_bar(stat="identity", fill = "royalblue4") +
  ggtitle("Combined Credit Card & Loyalty Transactions") +
  labs(x = "Locations", y = "Transaction Count") + 
  coord_flip() +
  theme_minimal()

ggplotly(popular_combine, tooltip = "text")

Based on the combined combined credit card and loyalty data, the most popular location is Katerina’s Cafe with a total of 256 transactions, followed by Hippokampos with 213 transactions and Guy’s Gyro with 187 transactions.

Generate an interactive heatmap using ggplot and plotly to determine the date and time when employees visit the locations.

day_location_count <- cc_loyalty_data %>%
  count(location, day) %>%
  rename(count = n)

popular_day_location <- ggplot(data = day_location_count,
                               aes(x=day, y=reorder(location, desc(location)),
                                   fill = count,
                                   text = paste("Location :", location,"\n",
                                                "Day of week:", day,"\n",
                                                "Number of transactions :", count))) +
  geom_tile()+
  scale_fill_gradient(low = "lightsteelblue1", high = "royalblue4") +
  ggtitle("Combined Credit Card & Loyalty Transactions by Day") +
  labs(x = "Day of the Week",y = "Locations") + 
  theme_minimal()

ggplotly(popular_day_location, tooltip = "text")

Based on the combined combined credit card and loyalty data, Brew’ve Been Served is popular on weekdays, Monday to Friday, with no transactions on weekend. Guy’s Gyro, Hippokampos, and Katerina’s Cafe are very popular throughout the week.

hour_location_count <- cc_loyalty_data %>%
  count(location, hour) %>%
  rename(count = n)
  
popular_hour_location <- ggplot(data = hour_location_count,
                               aes(x=hour, y=reorder(location, desc(location)),
                                   fill = count,
                                   text = paste("Location :", location,"\n",
                                                "Hour of the Day:", hour,"\n",
                                                "Number of transactions :", count))) +
  geom_tile()+
  scale_fill_gradient(low = "lightsteelblue1", high = "royalblue4") +
  ggtitle("Combined Credit Card & Loyalty Transactions by Hour of Day") +
  labs(x = "Hour of the Day",y = "Locations") + 
  theme_minimal()

ggplotly(popular_hour_location, tooltip = "text")

Based on the time of transaction, Brew’ve Been Served and Hallowed Grounds are popular in the morning around 7AM and 8AM.

Abila Zacharo, Bean There Done That, Brewed Awakenings, Gelatogalore, Guy’s Gyro, Hippokampos, Jack’s Magical Beans, Kalami Kafenion, Katerina’s Kafe, and Ouzera Elian are popular during lunch break around 12NN to 1PM.

Guy’s Gyro, Hippokampos, and Katerina’s Kafe are popular during dinner around 7PM and 8PM.

Generate an interactive boxplot using plotly to determine the outliers and provide clues on some anomalies.

outlier <- plot_ly(data = cc_loyalty_data,
                   x = ~price,
                   color = I("royalblue4"),
                   alpha = 0.5,
                   boxpoints = "suspectedoutliers") %>%
  add_boxplot(y = ~reorder(location, desc(location))) %>%
  layout(title = "Combined Credit Card & Loyalty Transactions Outliers",
         yaxis = list(title = "Locations"),
         xaxis = list(title = "Price"))

outlier

Based on the price of transaction, it seems that there is unusual expensive purchase at Frydos Autosupply n More amount to 10K. This is highly suspicious since the mean price for this location is only 161.96.

Generate an interactive linegraph using plot_anomaly_diagnostics() of plotly to diagnose anomalous points in the cc_data purchase prices. Note that only locations with sufficient number of observations were selected for the anomaly diagnostics.

cc_data %>%
  filter(location %in% c("Abila Airport",
                         "Albert's Fine Clothing",
                         "Carlyle Chemical Inc.",
                         "Chostus Hotel",
                         "Frydos Autosupply n' More",
                         "Gelatogalore",
                         "Nationwide Refinery",
                         "Stewart and Sons Fabrication")) %>%
  group_by(location) %>%
  plot_anomaly_diagnostics(timestamp, price, 
                           .facet_ncol = 2,
                           .y_lab = "Price")

Based on the anomaly diagnostics, there are unusual purchases in Gelatogalore, Frydos Autosupply n’ More, Albert’s Fine Clothing, and Chostus Hotel. Again, the most expensive purchase is from Frydos Autosupply n More amounting to 10,000 on 2014-01-13 19:20:00.

The anomalies will not be removed or corrected. It will be kept in the data as it is since it may lead to more clues in solving the challenge.

Q2: Anomalies in Vehicle, Credit Card and Loyalty Card Data

Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?

Generate an interactive heatmap using ggplot and plotly based on the amount of transactions with missing last4ccnum.

missing_last4ccnum <- cc_loyalty_data %>%
   filter(is.na(last4ccnum)) 

na_last4ccnum  <- ggplot(data = missing_last4ccnum,
                         aes(x=date, y=reorder(location, desc(location)),
                                   fill = price,
                                   text = paste("Location :", location,"\n",
                                                "Date:", date,"\n",
                                                "Total Amount of Transaction:", price))) +
  geom_tile()+
  scale_fill_gradient(low = "lightsteelblue1", high = "royalblue4") +
  ggtitle("Transactions with Missing Credit Card Data by Date") +
  labs(x = "Date of Transaction", y = "Locations") +
  theme_minimal()

ggplotly(na_last4ccnum, tooltip = "text")

Based on the total amount of transactions with missing credit card, National Refinery has a transaction on 2014-01-08 with a price of 4367.63. Stewart and Sons Fabrication has a transaction on 2014-01-13 with a price of 4071.95 and another one on 2014-01-15 with a price of 4485.38.

The discrepancies may be caused by employees who bought the items from these locations with cash instead of credit card but still used the loyalty card.

Generate another interactive heatmap using ggplot and plotly based on the amount of transactions with missing loyaltynum.

missing_loyaltynum <- cc_loyalty_data %>%
   filter(is.na(loyaltynum))

na_loyaltynum  <- ggplot(data = missing_loyaltynum,
                         aes(x=date, y=reorder(location, desc(location)),
                                   fill = price,
                                   text = paste("Location :", location,"\n",
                                                "Date:", timestamp,"\n",
                                                "Total Amount of Transaction:", price))) +
  geom_tile()+
  scale_fill_gradient(low = "lightsteelblue1", high = "royalblue4") +
  ggtitle("Transactions with Missing Loyalty Data by Date") +
  labs(x = "Date of Transaction", y = "Locations") +
  theme_minimal()

ggplotly(na_loyaltynum, tooltip = "text")

Based on the total amount of transactions with missing loyalty card, Frydos Autosupply n More has a transaction on 2014-01-13 19:20:00 with a price of 10,000.

The discrepancy is more suspicious since the person who bought the items did not use his loyalty card which may imply possible misuse of the credit card when making the transaction.

Add the gps and car data by creating a movement path from GPS points using the CarIDs as unique identifier. Filter the data around the time of transaction from 2014-01-13 19:00 to 21:00.

gps_path_0113 <- car_gps_sf %>%
  filter(timestamp >= "2014-01-13 19:00" & timestamp <= "2014-01-13 21:00") %>%
  group_by(CarID, date) %>%
  summarize(m = mean(timestamp), 
            do_union=FALSE) %>%
  st_cast("LINESTRING") 

Plot the gps path on the backgroup tourist map and identity which CarIDs are within the vicinity of Frydos Autosupply n More.

gps_path_selected_0113 <- gps_path_0113 %>%
  filter(CarID %in% c("13" , "15", "16", "34"))

tmap_mode("view")
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255) +
  tm_shape(gps_path_selected_0113) +
  tm_lines() +
  tm_facets(by = "CarID",  ncol = 1)

From the gps paths, CarID “13” , “15”, “16”, “34” are within the vicinity of Frydos Autosupply n More during suspicions transaction amounting to the price of 10,000.

Create an interative data table based on the joint gps and car and filter the date to 2014-01-13.

car_gps_0113 <- car_gps_data %>%
  filter(timestamp >= "2014-01-13 19:00" & timestamp <= "2014-01-13 21:00") %>%
  filter(CarID %in% c("13" , "15", "16", "34")) %>%
  group_by(CarID, Deparment, Title, FullName) %>%
  summarise()

DT::datatable(car_gps_0113)

From the interactive table, all CarIDs identified are from the Security Department. Possibly, Isia Vann and Edvard Vann are relatives because of the same Last Name and working together as Perimeter Controller.

Q3: Owners of Credit Card and Loyalty Card

Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data?

It is challenging to infer the owners of the credit card and loyalty card since there is no data field to join the credit card and gps data.

One possible approach to plot the gps path and check it against timestamp of the credit card transactions. This approach assumes that the person driving the car is the same person making the credit card transaction. Additionally, it assumes that the gps coordinates, timestamp as well as the credit location and timestamp are accurate.

To implement this proposed approach, it assumes that the geospatial tracking software installed in the employees car will stop tracking if the vehicle is not moving. Borrowing from concept of ‘Point of Interest’ (POI) from Virginia Tech, POI is identified if the employee stops for more than 5 minutes.

Identify the POIs by computing the difference of gps timestamp. If the difference is greater than 5 minutes, it will be set to poi = TRUE.

gps_poi_sf <- car_gps_sf %>%
  group_by(CarID) %>%
  mutate(diff = timestamp - lag(timestamp, order_by=CarID)) %>%
  mutate(poi = if_else(diff > 60*5,  TRUE, FALSE)) %>%
  filter(poi == TRUE) %>%
  ungroup() 

glimpse(gps_poi_sf)
Rows: 3,067
Columns: 11
$ timestamp <dttm> 2014-01-06 06:53:01, 2014-01-06 07:05:01, 2014-01~
$ CarID     <fct> 4, 35, 4, 10, 34, 26, 20, 19, 18, 12, 32, 33, 3, 7~
$ date      <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, 2~
$ day       <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, ~
$ hour      <int> 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,~
$ geometry  <POINT [°]> POINT (24.86419 36.07333), POINT (24.87336 3~
$ Deparment <chr> "Executive", "Executive", "Executive", "Executive"~
$ Title     <chr> "SVP/CFO", "Environmental Safety Advisor", "SVP/CF~
$ FullName  <chr> "Ingrid Barranco", "Willem Vasco-Pais", "Ingrid Ba~
$ diff      <drtn> 1057 secs, 1920 secs, 1320 secs, 1980 secs, 1633 ~
$ poi       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR~

Plot the POIs on the backgroup tourist map and identity the employee locations. From the Combined Credit Card & Loyalty Transactions by Hour of Day heatmap of Question 1, the earliest transaction is around 3:00 AM from Kronos Mart while the last transaction is 10:00 PM from Hippokampos. This information can be used to limit the number of POIs.

gps_poi_points <- gps_poi_sf %>%
  filter(hour >= 3 & hour <= 23) %>%
  select(timestamp,
         CarID,
         Deparment,
         Title,
         FullName)

tmap_mode("view")
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255) +
  tm_shape(gps_poi_points) +
  tm_dots(col = 'red', border.col = 'black', size = 1, alpha = 0.5, jitter = .8) +
  tm_facets(by = "FullName", ncol = 1)

Create an interative data table based on the joint credit card and infer the owner based on the POI plot.

cc_owner <- cc_data %>%
  select(timestamp, location, last4ccnum)

DT::datatable(cc_owner)

Generate a table after manually mapping the credit card transaction purchases against the POI map.

FullName | last4ccnum | Evidence - POI timestamp | Evidence - Corresponding Credit Card Transaction
—–|

Q4: Relationships among GASTech Personnel

Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships.

Q5: Evidence of Suspicious Activities

Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why.

4. Analysis and Conclusion

5. References

Citation

For attribution, please cite this work as

Dolit (2021, July 25). Visual Analytics & Applications: Visual Detective Assignment. Retrieved from https://adolit-vaa.netlify.app/posts/2021-07-18-assignment/

BibTeX citation

@misc{dolit2021visual,
  author = {Dolit, Archie},
  title = {Visual Analytics & Applications: Visual Detective Assignment},
  url = {https://adolit-vaa.netlify.app/posts/2021-07-18-assignment/},
  year = {2021}
}